Matching graphs generated from source code

ABSTRACT

Techniques are described herein for training a machine learning model and using the trained machine learning model to more accurately determine alignments between matching/corresponding nodes of predecessor and successor graphs representing predecessor and successor source code snippets. A method includes: obtaining a first abstract syntax tree that represents a predecessor source code snippet and a second abstract syntax tree that represents a successor source code snippet; determining a mapping across the first and second abstract syntax trees; obtaining a first control-flow graph that represents the predecessor source code snippet and a second control-flow graph that represents the successor source code snippet; aligning blocks in the first control-flow graph with blocks in the second control-flow graph; and applying the aligned blocks as inputs across a trained machine learning model to generate an alignment of nodes in the first abstract syntax tree with nodes in the second abstract syntax tree.

BACKGROUND

One way to identify changes made to a piece or “snippet” of source codeis to generate graphs that represent the source code snippet before andafter editing. These graphs may represent what will be referred toherein as “predecessor” and “successor” source code snippets. Thepredecessor source code snippet may be the source code snippet ofinterest prior to some changes being made to it, and may be representedby a first graph. The successor source code snippet may be the samesource code snippet after the changes have been made, and may berepresented by a second graph. In some cases, a change graph may bedetermined from the first and second graphs, and may represent thechanges made to the source code snippet. Each graph may take variousforms, such as an abstract syntax tree (AST), a control flow graph(CFG), etc. Heuristics exist to map matching nodes of the (predecessor)first graph to nodes of the second (successor) graph. However, theseheuristics tend to be somewhat inaccurate, which in turn can causedownstream operations that rely on the mappings to be inaccurate aswell.

SUMMARY

Implementations are described herein for generating an alignment ofnodes in a first graph representing a predecessor source code snippetand a second graph representing a successor source code snippet. Invarious implementations, a machine learning model such as a graph neuralnetwork may be trained to generate an alignment of nodes across abstractsyntax trees using, as inputs, aligned blocks in control-flow graphsrepresenting the predecessor and successor source code snippets. Thisalignment may be used for purposes such as generating a change graphthat represents one or more edits made to the predecessor source codesnippet to yield the successor source code snippet.

In some implementations, the first control-flow graph and the secondcontrol-flow graph may be obtained using a parser to generate the firstcontrol-flow graph from the predecessor source code snippet and thesecond control-flow graph from the successor source code snippet, andthe blocks in control-flow graphs may be aligned based on a mappingacross the first abstract syntax tree and the second abstract syntaxtree. The mapping may be generated using a tree-based code differencingalgorithm.

In another aspect, a synthetic training dataset may be generated using afirst abstract syntax tree that represents a source code snippet. Invarious implementations, a second abstract syntax tree may be generatedfrom the first abstract syntax tree. In some implementations, fieldvalues in various nodes in the second abstract syntax tree may bechanged. One or more nodes may be deleted in the first abstract syntaxtree and/or the second abstract syntax tree. Additionally, a parent nodemay be changed for one or more nodes in the second abstract syntax tree.In various implementations, a machine learning model may be trained togenerate an alignment of nodes based on the first abstract syntax treeand the second abstract syntax tree.

In various implementations, a method implemented by one or moreprocessors may include: obtaining a first abstract syntax tree thatrepresents a predecessor source code snippet and a second abstractsyntax tree that represents a successor source code snippet; determininga mapping across the first abstract syntax tree and the second abstractsyntax tree between pairs of matching nodes; obtaining a firstcontrol-flow graph that represents the predecessor source code snippetand a second control-flow graph that represents the successor sourcecode snippet; aligning blocks in the first control-flow graph withblocks in the second control-flow graph based on the mapping across thefirst abstract syntax tree and the second abstract syntax tree; andapplying the aligned blocks as inputs across a trained machine learningmodel to generate an alignment of nodes in the first abstract syntaxtree with nodes in the second abstract syntax tree.

In some implementations, the trained machine learning model may includea graph neural network. In some implementations, the determining themapping across the first abstract syntax tree and the second abstractsyntax tree may include using a tree-based code differencing algorithm.In some implementations, the obtaining the first control-flow graph andthe second control-flow graph may include using a parser to generate thefirst control-flow graph from the predecessor source code snippet andthe second control-flow graph from the successor source code snippet.

In some implementations, the aligning the blocks in the firstcontrol-flow graph with the blocks in the second control-flow graphbased on the mapping across the first abstract syntax tree and thesecond abstract syntax tree may include: determining a mapping acrossthe first control-flow graph and the second control-flow graph betweenpairs of similar blocks identified using the mapping across the firstabstract syntax tree and the second abstract syntax tree; and using themapping across the first control-flow graph and the second control-flowgraph to align the blocks in the first control-flow graph with theblocks in the second control-flow graph.

In some implementations, in the applying the aligned blocks as inputsacross the trained machine learning model to generate the alignment ofnodes in the first abstract syntax tree with nodes in the secondabstract syntax tree, candidate node alignments may be constrained basedon nodes in the aligned blocks.

In some implementations, a change graph may be generated based on thealignment of the nodes in the first abstract syntax tree with the nodesin the second abstract syntax tree. The change graph may represent oneor more edits made to the predecessor source code snippet to yield thesuccessor source code snippet.

In some additional or alternative implementations, a computer programproduct may include one or more non-transitory computer-readable storagemedia having program instructions collectively stored on the one or morenon-transitory computer-readable storage media. The program instructionsmay be executable to: obtain a first abstract syntax tree thatrepresents a predecessor source code snippet and a second abstractsyntax tree that represents a successor source code snippet; determine amapping across the first abstract syntax tree and the second abstractsyntax tree between pairs of matching nodes; obtain a first control-flowgraph that represents the predecessor source code snippet and a secondcontrol-flow graph that represents the successor source code snippet;align blocks in the first control-flow graph with blocks in the secondcontrol-flow graph based on the mapping across the first abstract syntaxtree and the second abstract syntax tree; and apply the aligned blocksas inputs across a trained machine learning model to generate analignment of nodes in the first abstract syntax tree with nodes in thesecond abstract syntax tree.

In some implementations, the program instructions may be furtherexecutable to generate a change graph based on the alignment of thenodes in the first abstract syntax tree with the nodes in the secondabstract syntax tree. The change graph may represent one or more editsmade to the predecessor source code snippet to yield the successorsource code snippet.

In some additional or alternative implementations, a system may includea processor, a computer-readable memory, one or more computer-readablestorage media, and program instructions collectively stored on the oneor more computer-readable storage media. The program instructions may beexecutable to: obtain a first abstract syntax tree that represents apredecessor source code snippet and a second abstract syntax tree thatrepresents a successor source code snippet; determine a mapping acrossthe first abstract syntax tree and the second abstract syntax treebetween pairs of matching nodes; obtain a first control-flow graph thatrepresents the predecessor source code snippet and a second control-flowgraph that represents the successor source code snippet; align blocks inthe first control-flow graph with blocks in the second control-flowgraph based on the mapping across the first abstract syntax tree and thesecond abstract syntax tree; and apply the aligned blocks as inputsacross a trained machine learning model to generate an alignment ofnodes in the first abstract syntax tree with nodes in the secondabstract syntax tree.

It should be appreciated that all combinations of the foregoing conceptsand additional concepts described in greater detail herein arecontemplated as being part of the subject matter disclosed herein. Forexample, all combinations of claimed subject matter appearing at the endof this disclosure are contemplated as being part of the subject matterdisclosed herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example environment in which selected aspects of thepresent disclosure may be implemented, in accordance with variousimplementations.

FIG. 2 is a block diagram showing an example of how one or more machinelearning models trained using techniques described herein may be used tomake inferences, in accordance with various implementations.

FIG. 3 is a block diagram showing an example of how a synthetic trainingdataset may be generated and used to train one or more machine learningmodels to make inferences, in accordance with various implementations.

FIG. 4 depicts a flowchart illustrating an example method for using oneor more machine learning models trained using techniques describedherein to make inferences, in accordance with various implementations.

FIG. 5 depicts a flowchart illustrating an example method for generatinga synthetic training dataset and using the synthetic training dataset totrain one or more machine learning models, in accordance with variousimplementations.

FIG. 6 illustrates an example architecture of a computing device.

DETAILED DESCRIPTION

FIG. 1 depicts an example environment 100 in which selected aspects ofthe present disclosure may be implemented, in accordance with variousimplementations. Any computing devices depicted in FIG. 1 or elsewherein the figures may include logic such as one or more microprocessors(e.g., central processing units or “CPUs”, graphical processing units or“GPUs”) that execute computer-readable instructions stored in memory, orother types of logic such as application-specific integrated circuits(“ASIC”), field-programmable gate arrays (“FPGA”), and so forth. Some ofthe systems depicted in FIG. 1, such as a code knowledge system 102, maybe implemented using one or more server computing devices that form whatis sometimes referred to as a “cloud infrastructure,” although this isnot required.

Code knowledge system 102 may be configured to perform selected aspectsof the present disclosure in order to help one or more clients 110_(1-P) make various inferences based on implemented and/or potentialchanges to the clients' respective source code bases 112 _(1-P). Forexample, code knowledge system 102 may be configured to determinealignments of nodes between graphs representing predecessor source codesnippets and graphs representing successor source code snippetsassociated with source code bases 112 _(1-P) of clients 110 _(1-P).

These alignments may then be used for a variety of different purposes,such as to generate a change graph for use as input to other downstreamsource code predictions, such as to predict code change intents (e.g.,change log entries), comments to be embedded into source code,identification of coding mistakes, etc. These change graphs can also beused in other contexts to train other types of machine learning models.For example, a machine learning model such as a graph neural network(GNN) may be trained using change graphs generated as described hereinto predict code changes, e.g., during a large-scale source code updateand/or migration.

Each client 110 may be, for example, an entity or organization such as abusiness (e.g., financial institute, bank, etc.), non-profit, club,university, government agency, or any other organization that operatesone or more software systems. For example, a bank may operate one ormore software systems to manage the money under its control, includingtracking deposits and withdrawals, tracking loans, tracking investments,and so forth. An airline may operate one or more software systems forbooking/canceling/rebooking flight reservations, managing delays orcancelations of flight, managing people associated with flights, such aspassengers, air crews, and ground crews, managing airport gates, and soforth.

Many of these entities' code bases 112 may be highly complex, requiringteams of programmers and/or software engineers to perform code basemigrations, maintenance, and/or updates. Many of these personnel may beunder considerable pressure, and may place low priority on tasks thatmight be considered “menial” or expendable, such as composingdescriptive and/or helpful code change intents, in embedded comments oras part of change list entries. Moreover, a mass code update ormigration may require myriad small changes to source code at numerouslocations, further challenging these personnel.

Accordingly, code knowledge system 102 may be configured to leverageknowledge of past changes made to source code, such as during code basemigration, update, or maintenance events, in order to automate taskssuch as composition and/or summarization of code change intents and/orcomments, to predict code changes, etc. Many of these tasks may rely onthe ability to accurately and quickly identify changes made to sourcecode. Although it is possible to perform text comparisons to determinetextual changes between different versions of source code, these textualchanges may not convey structure relationships embodied in the sourcecode, e.g., between different logical branches, statements, variables,etc.

Source code—and changes to source code—can also be represented in graphform. For example, source code may be converted into an abstract syntaxtree (AST) and/or control flow graph (CFG), either of which may maintainnot only the syntax of the code, but also the underlying structure. Achange graph can be generated based on graphs representing a source codesnippet before (predecessor) and after (successor) the source codesnippet is changed.

Conventional heuristics for determining an alignment betweenmatching/corresponding nodes of the predecessor and successor graphs inorder to generate a change graph may have limited accuracy. Many of thecode changes may be minor, may be relatively hard to discern in similarcontexts, and/or may be incomplete and/or semantically incorrect.Accordingly, code knowledge system 102 is configured with selectedaspects of the present disclosure to leverage machine learning to moreaccurately determine alignments between matching/corresponding nodes ofpredecessor and successor graphs representing predecessor and successorsource code snippets.

As noted above, alignments between matching nodes in general, and changegraphs generated therefrom in particular, may have a variety of uses. Asone example, with change graphs generated using techniques describedherein, a machine learning model such as a GNN may be trained to predictcode changes, e.g., to automate at least part of a widespread sourcecode update and/or migration. As another example, change graphsgenerated using techniques described herein may be processed using amachine learning model such as a GNN to automatically predict and/orcompose code change intents. Code change intents may be embodied invarious forms, such as in change list entries that are sometimesrequired when an updated source code snippet is committed (e.g.,installed, stored, incorporated) into a code base, in comments (e.g.,delimited with symbols such as “//” or “#”) embedded in the source code,in change logs, or anywhere else where human-composed languageindicating an intent behind a source code change might be found.

In either case (predicting code changes or the intents behind them),labeled pairs of predecessor/successor source code snippets may be usedto generate corresponding pairs of graphs (e.g., ASTs, CFGs). Thesegraph pairs may be processed with a machine learning model such as a GNNto generate an embedding in vector space. Techniques such as tripletloss may then be used to train the machine learning model based on theembedding's relative proximity in the latent space to other embeddingshaving similar and dissimilar labels. Labels used for code changeprediction and labels used for code change intent prediction may or maynot be similar, identical, or entirely different from each other.

Subsequently, to predict a code change, a source code snippetto-be-updated may be converted into graph form and embedded into thevector space using the trained machine learning model. Various nearestneighbor search algorithms may then be used to identify proximateembeddings that represent previous code changes made during previousmigrations. These previous code changes may be considered as candidateedits for the source code snippet to-be-updated. Similarly, to predict acode change intent, predecessor and successor source code snippets maybe converted into graph form and embedded into the same vector space ora different vector space using the trained machine model. Variousnearest neighbor search algorithms may then be used to identifyproximate embeddings that represent previous code changes made duringprevious migrations, as well as code change intents behind thosechanges.

In various implementations, code knowledge system 102 may include amachine learning (“ML” in FIG. 1) database 104 that includes dataindicative of one or more trained machine learning models 106 _(1-N).These machine learning models 106 _(1-N) may take various forms thatwill be described in more detail below, including but not limited to aGNN, a graph matching network (GMN), a sequence-to-sequence model suchas various flavors of a recurrent neural network (e.g., long short-termmemory, or “LSTM”, gate recurrent units, or “GRU”, etc.) or anencoder-decoder, and any other type of machine learning model that maybe applied to facilitate selected aspects of the present disclosure. Insome implementations, code knowledge system 102 may also have access toone or more code base(s) 108. In some implementations, the code bases108 may be used, for instance, to train one or more of the machinelearning models 106 _(1-N).

In various implementations, a client 110 that wishes to take advantageof techniques described herein to, for example, predict and/or implementcode changes and/or code change intents when migrating, updating, oreven maintaining its code base 112 may establish a relationship with anentity (not depicted in FIG. 1) that hosts code knowledge system 102. Insome implementations, code knowledge system 102 may then process all orparts of the client's source code base 112, e.g., by interfacing withthe client's software development version control system (not depicted)over one or more networks 114 such as the Internet. Based on thisprocessing, code knowledge system 102 may perform various techniquesdescribed herein for predicting and/or utilizing code changes and/or theintents behind them. In other implementations, e.g., where the client'scode base 112 is massive, one or more representatives of the entity thathosts code knowledge system 102 may travel to the client's site(s) toperform updates and/or make recommendations.

FIG. 2 is a block diagram showing an example process flow 200 in whichcode knowledge system 102 may use one or more machine learning models106 _(1-N) trained using techniques described herein to make inferences,in accordance with various implementations. Various components depictedin FIG. 2 may be implemented by code knowledge system 102 or separatelyfrom code knowledge system 102. These components may be implementedusing any combination of hardware and computer-readable instructions.

Beginning at the top left, in implementations, a predecessor source codesnippet 205 and a successor source code snippet 210 may be processed bya “code-to-AST” component 215 to generate, respectively, first AST 220and second AST 225. In other implementations, source code snippets 205,210 may be converted into other types of graphs.

In implementations, the first AST 220 and the second AST 225 may beprocessed by a mapping module 230 to generate an AST mapping 235. Themapping module 230 may generate, as the AST mapping 235, a mappingacross the first AST 220 and the second AST 225 using, e.g., atree-based code differencing algorithm. In implementations, the ASTmapping 235 may be a mapping between pairs of similar nodes in the firstAST 220 and the second AST 225.

In implementations, the predecessor source code snippet 205 and thesuccessor source code snippet 210 may also be processed by a“code-to-CFG” component 240 to generate, respectively, first CFG 245 andsecond CFG 250. In implementations, the code-to-CFG component 240 mayuse a parser to generate the first CFG 245 from the predecessor sourcecode snippet 205 and the second CFG 250 from the successor source codesnippet 210.

In implementations, the first CFG 245 and the second CFG 250 may beprocessed by a block alignment module 255 to generate aligned blocks260. The block alignment module 255 may align blocks in the first CFG245 with blocks in the second CFG 250 based on the AST mapping 235. Inimplementations, the block alignment module 255 may determine a mappingacross the first CFG 245 and the second CFG 250 between pairs of similarblocks identified using the AST mapping 235 across the first AST 220 andthe second AST 225. The block alignment module 255 may then use themapping across the first CFG 245 and the second CFG 250 to generatealigned blocks 260 in which the blocks in the first CFG 245 are alignedwith the blocks in the second CFG 250.

In implementations, the first AST 220, the second AST 225, and thealigned blocks 260 may be applied by an alignment module 265 as inputsacross a machine learning model 106 _(1-N) (e.g., a GNN) to generate apredicted node alignment 270 between the first AST 220 and the secondAST 225. In implementations, the alignment module 265 may constraincandidate node alignments based on nodes in the aligned blocks 260. Inthis manner, the predicted node alignment 270 may be generated in ahierarchical manner, block by block. For example, the alignment module265 may use the machine learning model 106 _(1-N) to generate aplurality of node similarity measures between individual nodes of thefirst AST 220 and nodes of the second AST 225 that are within alignedblocks and then generate the predicted node alignment 270 based on thenode similarity measures. This may be true where, for instance, themachine learning model 106 _(1-N) is a GNN. With a GNN, each nodesimilarity measure of the plurality of node similarity measures may bebased on a cross-graph attention mechanism (e.g., an attention layer)employed by the GNN. The cross-graph attention mechanism (e.g.,attention layer) may provide an attention weight (also referred to as an“attention coefficient”) for each possible pair of nodes that includes anode from the first AST 220 and a node from the second AST 225. Thus, insome implementations, the machine learning model 106 _(1-N) takes theform of a cross-graph attention mechanism employed as part of a GNN.

The predicted node alignment 270 may be provided to various downstreammodule(s) 275 for additional processing. For example, one downstreammodule 275 may generate a change graph 280. As mentioned previously,change graph 280 may be used for a variety of purposes. For example, aprediction module (not shown) may be configured to process the changegraph 280, e.g., using a machine learning model 106 _(1-N) such as a GNNto make a prediction (not shown). These predictions may include, forinstance, predicted code changes, predicted code change intents, etc.

FIG. 3 is a block diagram showing an example process flow 300 in whichcode knowledge system 102 may generate a synthetic training dataset anduse the synthetic training dataset to train one or more machine learningmodels 106 _(1-N) to make inferences, in accordance with variousimplementations. Various components depicted in FIG. 3 may beimplemented by code knowledge system 102 or separately from codeknowledge system 102. These components may be implemented using anycombination of hardware and computer-readable instructions.

Beginning at the top left, in implementations, a source code snippet 305may be processed by a “code-to-AST” component 215 to generate first AST310. In implementations, the code-to-AST component 215 may also generatesecond AST 315, e.g., from the first AST 310. The second AST 315 may bea copy of the first AST 310. In other implementations, source codesnippet 305 may be converted into other types of graphs.

In implementations, the second AST 315 may be processed by a modifiermodule 320 to generate, as a synthetic training dataset in conjunctionwith the first AST 310, a modified second AST 325 and a node alignment330. The modifier module 320 may generate the modified second AST 325 bychanging a field value of each node of a first set of nodes in thesecond AST 315, deleting each node in a second set of nodes in thesecond AST 315, and/or changing a parent node of each node in a thirdset of nodes in the second AST 315. Additionally, in implementations,the modifier module 320 may delete at least one node in the first AST315. The node alignment 330 generated by the modifier module 320 mayindicate a mapping across the first AST 310 and the modified second AST325 between pairs of similar nodes (e.g., based on the modificationsmade by the modifier module 320). In implementations, the first AST 310,the modified second AST 325, and the node alignment 330 may be providedto a training module 335, which will be described shortly.

In implementations, the first AST 310 and the modified second AST 325may be applied by an alignment module 265 as inputs across a machinelearning model 106 _(1-N) (e.g., a GNN). Additionally, inimplementations, aligned blocks (not shown), e.g., generated in themanner described with respect to the aligned blocks 260 of FIG. 2, mayalso be applied by the alignment module 265 as inputs across the machinelearning model 106 _(1-N). The alignment module 265 may then use themachine learning model 106 _(1-N) to generate a predicted node alignment340 between the first AST 310 and the modified second AST 325, e.g., asdescribed with respect to FIG. 2.

In implementations, the predicted node alignment 340 may be provided tothe training module 335. Training module 335 may then perform acomparison of the predicted node alignment 340 and the node alignment330 (e.g., a ground truth node alignment) and may train the machinelearning model 106 _(1-N) based on the comparison.

FIG. 4 is a flowchart illustrating an example method 400 of using one ormore machine learning models 106 _(1-N) trained using techniquesdescribed herein to make inferences, in accordance with implementationsdisclosed herein. For convenience, the operations of the flowchart aredescribed with reference to a system that performs the operations. Thissystem may include various components of various computer systems, suchas one or more components of the code knowledge system 102. Moreover,while operations of method 400 are shown in a particular order, this isnot meant to be limiting. One or more operations may be reordered,omitted, or added.

At block 410, the system may obtain a first abstract syntax tree thatrepresents a predecessor source code snippet and a second abstractsyntax tree that represents a successor source code snippet. Inimplementations, at block 410, the code knowledge system 102 may obtainthe first abstract syntax tree (e.g., 220) that represents thepredecessor source code snippet (e.g., 205) and the second abstractsyntax tree (e.g., 225) that represents the successor source codesnippet (e.g., 210).

Still referring to FIG. 4, at block 420, the system may determine amapping across the first abstract syntax tree and the second abstractsyntax tree between pairs of matching nodes. In implementations, atblock 420, the code knowledge system 102 may determine the mapping(e.g., 235) across the first abstract syntax tree and the secondabstract syntax tree between pairs of matching nodes. Inimplementations, the code knowledge system 102 may use a tree-based codedifferencing algorithm to determine the mapping.

Still referring to FIG. 4, at block 430, the system may obtain a firstcontrol-flow graph that represents the predecessor source code snippetand a second control-flow graph that represents the successor sourcecode snippet. In implementations, at block 430, the code knowledgesystem 102 may obtain the first control-flow graph (e.g., 245) thatrepresents the predecessor source code snippet and the secondcontrol-flow graph (e.g., 250) that represents the successor source codesnippet. In implementations, the code knowledge system 102 may use aparser to generate the first control-flow graph from the predecessorsource code snippet and the second control-flow graph from the successorsource code snippet.

Still referring to FIG. 4, at block 440, the system may align blocks inthe first control-flow graph with blocks in the second control-flowgraph based on the mapping across the first abstract syntax tree and thesecond abstract syntax tree. In implementations, at block 440, the codeknowledge system 102 may align blocks in the first control-flow graphwith blocks in the second control-flow graph based on the mapping acrossthe first abstract syntax tree and the second abstract syntax tree togenerate aligned blocks (e.g., 260). In implementations, the codeknowledge system 102 may determine a mapping across the firstcontrol-flow graph and the second control-flow graph between pairs ofsimilar blocks identified using the mapping across the first abstractsyntax tree and the second abstract syntax tree and then use the mappingacross the first control-flow graph and the second control-flow graph toalign the blocks in the first control-flow graph with the blocks in thesecond control-flow graph.

Still referring to FIG. 4, at block 450, the system may apply thealigned blocks as inputs across a trained machine learning model togenerate an alignment of nodes in the first abstract syntax tree withnodes in the second abstract syntax tree. In implementations, at block450, the code knowledge system 102 may apply the aligned blocks asinputs across the trained machine learning model (e.g., 106 _(1-N)) togenerate an alignment of nodes (e.g., 270) in the first abstract syntaxtree with nodes in the second abstract syntax tree. In implementations,the machine learning model 106 _(1-N) may be a GNN. In otherimplementations, the machine learning model 106 _(1-N) may be a GMN orany other type of machine learning model. In implementations, the codeknowledge system 102 may constrain candidate node alignments based onnodes in the aligned blocks.

Still referring to FIG. 4, at block 460, the system may generate achange graph based on the alignment of the nodes in the first abstractsyntax tree with the nodes in the second abstract syntax tree. Inimplementations, at block 460, the code knowledge system 102 maygenerate the change graph (e.g., 280) based on the alignment of thenodes in the first abstract syntax tree with the nodes in the secondabstract syntax tree. The change graph may represent one or more editsmade to the predecessor source code snippet to yield the successorsource code snippet.

FIG. 5 is a flowchart illustrating an example method 500 of generating asynthetic training dataset and using the synthetic training dataset totrain one or more machine learning models 106 _(1-N) to make inferences,in accordance with implementations disclosed herein. For convenience,the operations of the flowchart are described with reference to a systemthat performs the operations. This system may include various componentsof various computer systems, such as one or more components of the codeknowledge system 102. Moreover, while operations of method 500 are shownin a particular order, this is not meant to be limiting. One or moreoperations may be reordered, omitted, or added.

At block 510, the system may obtain a first abstract syntax tree thatrepresents a source code snippet. In implementations, at block 510, thecode knowledge system 102 may obtain the first abstract syntax tree(e.g., 310) that represents a source code snippet (e.g., 305).

Still referring to FIG. 5, at block 520, the system may generate asecond abstract syntax tree from the first abstract syntax tree. Inimplementations, at block 520, the code knowledge system 102 maygenerate the second abstract syntax tree (e.g., 315) from the firstabstract syntax tree.

Still referring to FIG. 5, at block 530, for each node in a first set ofnodes in the second abstract syntax tree, the system may change a fieldvalue of the node. In implementations, at block 530, the code knowledgesystem 102 may, for each node in the first set of nodes in the secondabstract syntax tree, change a field value of the node to generate amodified second abstract syntax tree (e.g., 325).

Still referring to FIG. 5, at block 540, the system may delete each nodein a second set of nodes in the second abstract syntax tree. Inimplementations, at block 540, the code knowledge system 102 may deleteeach node in the second set of nodes in the second abstract syntax treeto generate a modified second abstract syntax tree. In otherimplementations, at block 540, the code knowledge system 102 may deleteeach node in the second set of nodes in the modified second abstractsyntax tree to further modify the modified second abstract syntax tree.

Still referring to FIG. 5, at block 550, for each node in a third set ofnodes in the second abstract syntax tree, the system may change a parentnode of the node. In implementations, at block 550, the code knowledgesystem 102 may, for each node in the third set of nodes in the secondabstract syntax tree, change a parent node of the node to generate amodified second abstract syntax tree. In other implementations, at block550, the code knowledge system 102 may, for each node in the third setof nodes in the modified second abstract syntax tree, change a parentnode of the node to further modify the modified second abstract syntaxtree.

Still referring to FIG. 5, at block 560, the system may delete at leastone node in the first abstract syntax tree. In implementations, at block560, the code knowledge system 102 may delete at least one node in thefirst abstract syntax tree.

Still referring to FIG. 5, at block 570, the system may train a machinelearning model to generate an alignment of nodes based on the firstabstract syntax tree and the second abstract syntax tree. Inimplementations, at block 570, the code knowledge system 102 may train amachine learning model (e.g., 106 _(1-N)) to generate an alignment ofnodes (e.g., 340) based on the first abstract syntax tree and themodified second abstract syntax tree. In implementations, the machinelearning model 106 _(1-N) may be a GNN. In other implementations, themachine learning model 106 _(1-N) may be a GMN or any other type ofmachine learning model.

FIG. 6 is a block diagram of an example computing device 610 that mayoptionally be utilized to perform one or more aspects of techniquesdescribed herein. Computing device 610 typically includes at least oneprocessor 614 which communicates with a number of peripheral devices viabus subsystem 612. These peripheral devices may include a storagesubsystem 624, including, for example, a memory subsystem 625 and a filestorage subsystem 626, user interface output devices 620, user interfaceinput devices 622, and a network interface subsystem 616. The input andoutput devices allow user interaction with computing device 610. Networkinterface subsystem 616 provides an interface to outside networks and iscoupled to corresponding interface devices in other computing devices.

User interface input devices 622 may include a keyboard, pointingdevices such as a mouse, trackball, touchpad, or graphics tablet, ascanner, a touchscreen incorporated into the display, audio inputdevices such as voice recognition systems, microphones, and/or othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computing device 610 or onto a communication network.

User interface output devices 620 may include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem may include a cathode ray tube (CRT), aflat-panel device such as a liquid crystal display (LCD), a projectiondevice, or some other mechanism for creating a visible image. Thedisplay subsystem may also provide non-visual display such as via audiooutput devices. In general, use of the term “output device” is intendedto include all possible types of devices and ways to output informationfrom computing device 610 to the user or to another machine or computingdevice.

Storage subsystem 624 stores programming and data constructs thatprovide the functionality of some or all of the modules describedherein. For example, the storage subsystem 624 may include the logic toperform selected aspects of the process flows of FIGS. 2 and 3 and themethods of FIGS. 4 and 5, as well as to implement various componentsdepicted in FIG. 1.

These software modules are generally executed by processor 614 alone orin combination with other processors. The memory subsystem 625 includedin the storage subsystem 624 can include a number of memories includinga main random access memory (RAM) 630 for storage of instructions anddata during program execution and a read only memory (ROM) 632 in whichfixed instructions are stored. A file storage subsystem 626 can providepersistent storage for program and data files, and may include a harddisk drive, a floppy disk drive along with associated removable media, aCD-ROM drive, an optical drive, or removable media cartridges. Themodules implementing the functionality of certain implementations may bestored by file storage subsystem 626 in the storage subsystem 624, or inother machines accessible by the processor(s) 614.

Bus subsystem 612 provides a mechanism for letting the variouscomponents and subsystems of computing device 610 communicate with eachother as intended. Although bus subsystem 612 is shown schematically asa single bus, alternative implementations of the bus subsystem may usemultiple busses.

Computing device 610 can be of varying types including a workstation,server, computing cluster, blade server, server farm, or any other dataprocessing system or computing device. Due to the ever-changing natureof computers and networks, the description of computing device 610depicted in FIG. 6 is intended only as a specific example for purposesof illustrating some implementations. Many other configurations ofcomputing device 610 are possible having more or fewer components thanthe computing device depicted in FIG. 6.

In other implementations, a method implemented by one or more processorsmay include: obtaining a first abstract syntax tree that represents asource code snippet; generating a second abstract syntax tree from thefirst abstract syntax tree; for each node in a first set of nodes in thesecond abstract syntax tree, changing a field value of the node;deleting each node in a second set of nodes in the second abstractsyntax tree; and for each node in a third set of nodes in the secondabstract syntax tree, changing a parent node of the node.

In some implementations, the method may further include deleting atleast one node in the first abstract syntax tree. In someimplementations, the method may further include training a machinelearning model to generate an alignment of nodes based on the firstabstract syntax tree and the second abstract syntax tree. In someimplementations, the training the machine learning model may include:generating a predicted node alignment between the first abstract syntaxtree and the second abstract syntax tree; comparing the predicted nodealignment and a ground truth node alignment; and training the machinelearning model based on the comparing.

In other implementations, a computer program product may include one ormore non-transitory computer-readable storage media having programinstructions collectively stored on the one or more non-transitorycomputer-readable storage media. The program instructions may beexecutable to: obtain a first abstract syntax tree that represents asource code snippet; generate a second abstract syntax tree from thefirst abstract syntax tree; for each node in a first set of nodes in thesecond abstract syntax tree, change a field value of the node; deleteeach node in a second set of nodes in the second abstract syntax tree;and for each node in a third set of nodes in the second abstract syntaxtree, change a parent node of the node.

In some implementations, the program instructions may be furtherexecutable to delete at least one node in the first abstract syntaxtree. In some implementations, the program instructions may be furtherexecutable to train a machine learning model to generate an alignment ofnodes based on the first abstract syntax tree and the second abstractsyntax tree. In some implementations, the training the machine learningmodel may include: generating a predicted node alignment between thefirst abstract syntax tree and the second abstract syntax tree;comparing the predicted node alignment and a ground truth nodealignment; and training the machine learning model based on thecomparing.

In other implementations, a system may include: a processor, acomputer-readable memory, one or more computer-readable storage media,and program instructions collectively stored on the one or morecomputer-readable storage media. The program instructions may beexecutable to: obtain a first abstract syntax tree that represents asource code snippet; generate a second abstract syntax tree from thefirst abstract syntax tree; for each node in a first set of nodes in thesecond abstract syntax tree, change a field value of the node; deleteeach node in a second set of nodes in the second abstract syntax tree;and for each node in a third set of nodes in the second abstract syntaxtree, change a parent node of the node.

In some implementations, the program instructions may be furtherexecutable to delete at least one node in the first abstract syntaxtree. In some implementations, the program instructions may be furtherexecutable to train a machine learning model to generate an alignment ofnodes based on the first abstract syntax tree and the second abstractsyntax tree. In some implementations, the training the machine learningmodel may include: generating a predicted node alignment between thefirst abstract syntax tree and the second abstract syntax tree;comparing the predicted node alignment and a ground truth nodealignment; and training the machine learning model based on thecomparing.

Implementations may address problems with the limited accuracy ofconventional heuristics for determining an alignment betweenmatching/corresponding nodes of predecessor and successor graphs inorder to generate a change graph. In particular, some implementationsmay improve the functioning of a computer by providing methods andsystems for training a machine learning model and using the trainedmachine learning model to more accurately determine alignments betweenmatching/corresponding nodes of predecessor and successor graphsrepresenting predecessor and successor source code snippets.Accordingly, through the use of rules that improve computer-relatedtechnology, implementations allow computer performance of functions notpreviously performable by a computer. Additionally, implementations usetechniques that are, by definition, rooted in computer technology (e.g.,machine learning models, GNNs, GMNs, etc.).

While several implementations have been described and illustratedherein, a variety of other means and/or structures for performing thefunction and/or obtaining the results and/or one or more of theadvantages described herein may be utilized, and each of such variationsand/or modifications is deemed to be within the scope of theimplementations described herein. More generally, all parameters,dimensions, materials, and configurations described herein are meant tobe exemplary and that the actual parameters, dimensions, materials,and/or configurations will depend upon the specific application orapplications for which the teachings is/are used. Those skilled in theart will recognize, or be able to ascertain using no more than routineexperimentation, many equivalents to the specific implementationsdescribed herein. It is, therefore, to be understood that the foregoingimplementations are presented by way of example only and that, withinthe scope of the appended claims and equivalents thereto,implementations may be practiced otherwise than as specificallydescribed and claimed. Implementations of the present disclosure aredirected to each individual feature, system, article, material, kit,and/or method described herein. In addition, any combination of two ormore such features, systems, articles, materials, kits, and/or methods,if such features, systems, articles, materials, kits, and/or methods arenot mutually inconsistent, is included within the scope of the presentdisclosure.

What is claimed is:
 1. A method implemented by one or more processors,the method comprising: obtaining a first abstract syntax tree thatrepresents a predecessor source code snippet and a second abstractsyntax tree that represents a successor source code snippet; determininga mapping across the first abstract syntax tree and the second abstractsyntax tree between pairs of matching nodes; obtaining a firstcontrol-flow graph that represents the predecessor source code snippetand a second control-flow graph that represents the successor sourcecode snippet; aligning blocks in the first control-flow graph withblocks in the second control-flow graph based on the mapping across thefirst abstract syntax tree and the second abstract syntax tree; andapplying the aligned blocks as inputs across a trained machine learningmodel to generate an alignment of nodes in the first abstract syntaxtree with nodes in the second abstract syntax tree.
 2. The methodaccording to claim 1, wherein the trained machine learning modelcomprises a graph neural network.
 3. The method according to claim 1,wherein the determining the mapping across the first abstract syntaxtree and the second abstract syntax tree comprises using a tree-basedcode differencing algorithm.
 4. The method according to claim 1, whereinthe obtaining the first control-flow graph and the second control-flowgraph comprises using a parser to generate the first control-flow graphfrom the predecessor source code snippet and the second control-flowgraph from the successor source code snippet.
 5. The method according toclaim 1, wherein the aligning the blocks in the first control-flow graphwith the blocks in the second control-flow graph based on the mappingacross the first abstract syntax tree and the second abstract syntaxtree comprises: determining a mapping across the first control-flowgraph and the second control-flow graph between pairs of similar blocksidentified using the mapping across the first abstract syntax tree andthe second abstract syntax tree; and using the mapping across the firstcontrol-flow graph and the second control-flow graph to align the blocksin the first control-flow graph with the blocks in the secondcontrol-flow graph.
 6. The method according to claim 1, wherein in theapplying the aligned blocks as inputs across the trained machinelearning model to generate the alignment of nodes in the first abstractsyntax tree with nodes in the second abstract syntax tree, candidatenode alignments are constrained based on nodes in the aligned blocks. 7.The method according to claim 1, further comprising generating a changegraph based on the alignment of the nodes in the first abstract syntaxtree with the nodes in the second abstract syntax tree, wherein thechange graph represents one or more edits made to the predecessor sourcecode snippet to yield the successor source code snippet.
 8. A computerprogram product comprising one or more non-transitory computer-readablestorage media having program instructions collectively stored on the oneor more non-transitory computer-readable storage media, the programinstructions executable to: obtain a first abstract syntax tree thatrepresents a predecessor source code snippet and a second abstractsyntax tree that represents a successor source code snippet; determine amapping across the first abstract syntax tree and the second abstractsyntax tree between pairs of matching nodes; obtain a first control-flowgraph that represents the predecessor source code snippet and a secondcontrol-flow graph that represents the successor source code snippet;align blocks in the first control-flow graph with blocks in the secondcontrol-flow graph based on the mapping across the first abstract syntaxtree and the second abstract syntax tree; and apply the aligned blocksas inputs across a trained machine learning model to generate analignment of nodes in the first abstract syntax tree with nodes in thesecond abstract syntax tree.
 9. The computer program product accordingto claim 8, wherein the trained machine learning model comprises a graphneural network.
 10. The computer program product according to claim 8,wherein the determining the mapping across the first abstract syntaxtree and the second abstract syntax tree comprises using a tree-basedcode differencing algorithm.
 11. The computer program product accordingto claim 8, wherein the obtaining the first control-flow graph and thesecond control-flow graph comprises using a parser to generate the firstcontrol-flow graph from the predecessor source code snippet and thesecond control-flow graph from the successor source code snippet. 12.The computer program product according to claim 8, wherein the aligningthe blocks in the first control-flow graph with the blocks in the secondcontrol-flow graph based on the mapping across the first abstract syntaxtree and the second abstract syntax tree comprises: determining amapping across the first control-flow graph and the second control-flowgraph between pairs of similar blocks identified using the mappingacross the first abstract syntax tree and the second abstract syntaxtree; and using the mapping across the first control-flow graph and thesecond control-flow graph to align the blocks in the first control-flowgraph with the blocks in the second control-flow graph.
 13. The computerprogram product according to claim 8, wherein in the applying thealigned blocks as inputs across the trained machine learning model togenerate the alignment of nodes in the first abstract syntax tree withnodes in the second abstract syntax tree, candidate node alignments areconstrained based on nodes in the aligned blocks.
 14. The computerprogram product according to claim 8, wherein the program instructionsare further executable to generate a change graph based on the alignmentof the nodes in the first abstract syntax tree with the nodes in thesecond abstract syntax tree, wherein the change graph represents one ormore edits made to the predecessor source code snippet to yield thesuccessor source code snippet.
 15. A system comprising: a processor, acomputer-readable memory, one or more computer-readable storage media,and program instructions collectively stored on the one or morecomputer-readable storage media, the program instructions executable to:obtain a first abstract syntax tree that represents a predecessor sourcecode snippet and a second abstract syntax tree that represents asuccessor source code snippet; determine a mapping across the firstabstract syntax tree and the second abstract syntax tree between pairsof matching nodes; obtain a first control-flow graph that represents thepredecessor source code snippet and a second control-flow graph thatrepresents the successor source code snippet; align blocks in the firstcontrol-flow graph with blocks in the second control-flow graph based onthe mapping across the first abstract syntax tree and the secondabstract syntax tree; and apply the aligned blocks as inputs across atrained machine learning model to generate an alignment of nodes in thefirst abstract syntax tree with nodes in the second abstract syntaxtree.
 16. The system according to claim 15, wherein the trained machinelearning model comprises a graph neural network.
 17. The systemaccording to claim 15, wherein the determining the mapping across thefirst abstract syntax tree and the second abstract syntax tree comprisesusing a tree-based code differencing algorithm.
 18. The system accordingto claim 15, wherein the obtaining the first control-flow graph and thesecond control-flow graph comprises using a parser to generate the firstcontrol-flow graph from the predecessor source code snippet and thesecond control-flow graph from the successor source code snippet. 19.The system according to claim 15, wherein the aligning the blocks in thefirst control-flow graph with the blocks in the second control-flowgraph based on the mapping across the first abstract syntax tree and thesecond abstract syntax tree comprises: determining a mapping across thefirst control-flow graph and the second control-flow graph between pairsof similar blocks identified using the mapping across the first abstractsyntax tree and the second abstract syntax tree; and using the mappingacross the first control-flow graph and the second control-flow graph toalign the blocks in the first control-flow graph with the blocks in thesecond control-flow graph.
 20. The system according to claim 15, whereinin the applying the aligned blocks as inputs across the trained machinelearning model to generate the alignment of nodes in the first abstractsyntax tree with nodes in the second abstract syntax tree, candidatenode alignments are constrained based on nodes in the aligned blocks.